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Using the cavity method and diagrammatic methods, we model the dynamics of batch learning 
of restricted sets of examples. Simulations of the Green's function and the cavity activation dis- 
tributions support the theory well. The learning dynamics approaches a steady state in agreement 
with the static version of the cavity method. The picture of the rough energy landscape is reviewed. 

2 ■ I. INTRODUCTION 

The mean-field theory was first developed as an approximation to many physical systems in magnetic or disordered 
materials Jl]]. However, it is interesting that they become exact in many systems in information processing. The major 
reason of its success is that when compared with physical systems, these artificial systems have extensive interactions 
among their components. Hence when one component is considered, the influence of the rest of the system can be 
*^ ■ regarded as a background satisfying some averaged properties. 

Learning in large neural networks is a mean-field process since the examples and weights strongly interact with 
each other during the learning process. Learning is often achieved by defining an energy function which involves a 
training set of examples. The energy function is then minimized by a gradient descent process with respect to the 
weights until a steady state is reached. Each of the many weights is thus dependent on each of the many examples 
and vice versa. This makes it an ideal area for applying mean-field theories. 

There have been attempts using mean-field theories to describe the dynamics of learning. In batch learning, the 
same restricted set of examples is provided for each learning step. Using the dynamical mean field theory, early work 
^ has been done on the steady-state behavior and asymptotic time scales in perceptrons with binary weights, rather 
than the continuous weights of more common interest Q . Much benchmarking of batch learning has been done for 
i— 1 1 linear learning rules such as Hebbian learning or Adaline learning . The work on Adaline learning was further 
extended to the study of linear perceptrons learning nonlinear rules However, not much work has been done on 

the learning of nonlinear rules with continuous weights. In this respect, it is interesting to note the recent attempts 
I . using the dynamical replica theory . It approximates the temporal correlations during learning by instantaneous 
' effective macroscopic variables. Further approximations facilitate results for nonlinear learning. However, the rigor 
£NJ , of these approximations remain to be confirmed in the general case. 

• Batch learning is different from idealized models of on-line learning of infinite training sets, which has gained 
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© ; much progress In this model, an independent example is generated for each learning step. Since statistical 



correlations among the examples can be ignored, the many-body interactions among the examples, and hence among 
the weights, are absent. Hence they do not address the many-body aspects of the dynamics, which will be discussed 
here. Nevertheless, this simplification enables the dynamics to be simply described by instantaneous dynamical 
variables, resulting in a significant reduction in the complexity of analysis, thereby leading to great advances in our 
understanding of on-line learning. In multilayer perceptrons, for instance, the persistence of a permutation symmetric 
"■^J ■ stage which retards the learning process was well studied. Subsequent proposals to speed up learning were made, 
£h ! illustrating the usefulness of the on-line approach (ll].|u| • 

Here we review models of batch learning |l4]]l5| l where, however, such simplifications are not available. Since the 
same restricted set of examples is recycled during the learning process, there now exist temporal correlations of the 
parameters in the learning history. Nevertheless, we manage to consider the learning model as a many-body system. 
Each example makes a small contribution to the learning process, which can be described by linear response terms in 
fyN , a sea of background examples. Two ingredients are important to our theory: 

(a) The cavity method - Originally developed as the Thouless- Anderson-Palmer approach to magnetic systems and 
spin glasses |i~o| , the method was adopted to learning in perceptrons [|l7j , and subsequently extended to the teacher- 
student perceptron pq[ , the AND machine jl9|, the multiclass perceptron [^0|, the committee tree |pl| , p2| , Bayesian 
learning |2^] and pruned perceptrons pi| . These studies only considered the equilibrium properties of learning, 
whereas here we are generalizing the method to study the dynamics Ju|. It uses a self-consistency argument to 
compare the evolution of the activation of an example when it is absent or present in the training set. When absent, 
the activation of the example is called the cavity activation, in contrast to its generic counterpart when it is included 
in the training set. 

The cavity method yields macroscopic properties identical to the more conventional replica method [ fj"6| . However, 
since the replica method was originally devised as a technique to facilitate systemwide averages, it provides much less 
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information on the microscopic conditions of the individual dynamical variables. 

(b) The diagrammatic approach - To describe the difference between the cavity activation and its generic counterpart 
of an example, we apply linear response theory and use Green's function to describe how the influence of the added 
example propagates through the learning history. The Green's function is represented by a series of diagrams, whose 
averages over examples are performed by a set of pairing rules similar to those introduced for Adaline learning ||, 
as well as in the dynamics of layered networks p5 |. Here we take a further step and use the diagrams to describe 
the changes from cavity to generic activations, as was done in pq] , rather than the evolution of specific dynamical 
variables in the case of linear rules |g[ . Hence our dynamical equations are widely applicable to any gradient-descent 
learning rule which minimizes an arbitrary cost function in terms of the activation. It fully takes into account the 
temporal correlations during learning, and is exact for large networks. 

The study of learning dynamics should also provide further insights on the steady-state properties of learning. In 
this respect we will review the cavity approach to the steady-state behavior of learning, and the microscopic variables 
satisfy a set of TAP equations. The approach is particularly transparent when the energy landscape is smooth, i.e., 
no local minima interfere with the approach to the steady state. However, the picture is valid only when a stability 
condition (equivalent to the Almeida-Thouless condition in the replica method) is satisfied. Beyond this regime, local 
minima begin to appear and the energy landscape is roughened. In this case, a similar set of TAP equations remains 
valid. The physical picture has been presented in a more complete analysis is presented here. 

The paper is organized as follows. In Section 2 we formulate the dynamics of batch learning. In Section 3 we 
introduce the cavity method and the dynamical equations for the macroscopic variables. In Section 4 we present 
simulation results which support the cavity theory. In Sections 5 and 6 we consider the steady-state behaviour 
of learning and generalize the TAP equations respectively to the pictures of smooth and rough energy landscapes, 
followed by a conclusion in Section 7. The appendices explain the diagrammatic approach in describing the Green's 
function, the fluctuation response relation, and the equations for macroscopic parameters in the picture of rough 
energy landscapes. 



II. FORMULATION 



Consider the single layer perceptron with N 3> 1 input nodes connecting to a single output node by the weights 
{ Jj} and often, the bias 9 as well. For convenience we assume that the inputs £j are Gaussian variables with mean 
and variance 1, and the output state is a function f{x) of the activation x at the output node, where x — J ■ £ + 9. 

The training set consists of p = aN examples which map inputs to the outputs {S^} (/i = 1, . . . ,p). In the 
case of random examples, are random binary variables, and the perceptron is used as a storage device. In the case 
of teacher-generated examples, are the outputs generated by a teacher perceptron with weights {Bj} and often, a 
bias 4> as well, namely 5 M = f(y^); = B ■ f + <p. 

Batch learning is achieved by adjusting the weights {Jj} iteratively so that a certain cost function in terms of 
the activations {x^} and the output of all examples is minimized. Hence we consider a general cost function 
E = — ^2 ii g{x IJl ,y f j). The precise functional form of g(x,y) depends on the adopted learning algorithm. In previous 
studies, g(x, y) = —(S — x) 2 /2 in Adaline learning P, p7jj2l| , and g{x, y) = xS in Hebbian learning f|[|. 

To ensure that the perceptron fulfills the prior expectation of minimal complexity, it is customary to introduce a 
weight decay term. In the presence of noise, the gradient descent dynamics of the weights is given by 



dJj(t) 



i g'Mt), y ^f - (t) + Vj (t), (i) 



alt N 

where the prime represents partial differentiation with respect to x, A is the weight decay strength, and rjj{t) is the 
noise term at temperature T with 

2T 

(»&(*)> =0 and ( Vj (t) Vk (s)) = —6 jk 5(t-s). (2) 

The dynamics of the bias 9 is similar, except that no bias decay should be present according to consistency arguments 
i, ' 



dO{t) 



= 4 S^W*). »/*) + »»(*). (3) 



dt N 
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III. THE CAVITY METHOD 



Our theory is the dynamical version of the cavity method |T^,^l|,^2| . It uses a self-consistency argument to consider 
what happens when a new example is added to a training set. The central quantity in this method is the cavity 
activation, which is the activation of a new example for a perceptron trained without that example. Since the original 
network has no information about the new example, the cavity activation is random. Here we present the theory for 
8 — <f> = 0, skipping extensions to biased perceptrons. Denoting the new example by the label 0, its cavity activation 
at time t is ho(t) — J(t) ■ For large TV, ho(t) is a Gaussian variable. Its covariance is given by the correlation 
function C(t,s) of the weights at times t and s, that is, (h (t)h (s)) = J(t) ■ J(s) = C(t,s), where £° and are 
assumed to be independent for j k. For teacher-generated examples, the distribution is further specified by the 
teacher-student correlation R(t), given by (h (t)y ) = J(t) ■ B = R(t). 

Now suppose the perceptron incorporates the new example at the batch- mode learning step at time s. Then the 
activation of this new example at a subsequent time t > s will no longer be a random variable. Furthermore, the 
activations of the original p examples at time t will also be adjusted from {x^t)} to {x^t)} because of the newcomer, 
which will in turn affect the evolution of the activation of example 0, giving rise to the so-called Onsager reaction 
effects. This makes the dynamics complex, but fortunately for large p ~ N, we can assume that the adjustment from 
x^t) to x®(t) is small, and linear response theory can be applied. 

Suppose the weights of the original and new perceptron at time t are {Jj(t)} and {J?(t)} respectively. Then a 
perturbation of ([!]) yields 

(jt + X ) {J i {t) Jj{t)) = ^9'(x (t),y )^ + ^J2^9"(^(t),y^(J° k (t) - J k (t)). (4) 

The first term on the right hand side describes the primary effects of adding example to the training set, and is 
the driving term for the difference between the two perceptrons. The second term describes the many-body reactions 
due to the changes of the original examples caused by the added example, and is referred to as the Onsager reaction 
term. One should note the difference between the cavity and generic activations of the added example. The former is 
denoted by ho(t) and corresponds to the activation in the perceptron {Jj(t)}, whereas the latter, denoted by xo(t) and 
corresponding to the activation in the perceptron {Jj(t)}, is the one used in calculating the gradient in the driving 
term of (^). Since their notations are sufficiently distinct, we have omitted the superscript in xo(t), which appears 
in the background examples x^(t). 

The equation can be solved by the Green's function technique, yielding 



J°(i) - J 3 (i) = J2j dsG 3k {t, s) (Jjg' (s)e k 



(5) 



where g' (s) = g'(xo(s), yo) and Gjkit, s) is the weight Green's function, which describes how the effects of a pertur- 
bation propagates from weight at learning time s to weight Jj at a subsequent time t. In the present context, the 
perturbation comes from the gradient term of example 0, such that integrating over the history and summing over all 
nodes give the resultant change from Jj (t) to J° (t) . 

For large N the weight Green's function can be found by the diagrammatic approach explained in Appendix A. The 
result is self-averaging over the distribution of examples and is diagonal, i.e. limAr^oo Gjk(t, s) — G(t, s)Sjk, where 



G(t,s) = G^{t-s) + a J dti j rft 2 G^(f-t 1 )(^(< 1 ,< 2 ) 5 ; , (< 2 ))G(t 2 , S ). (6) 

Here the bare Green's function G^ [t — s) is given by 

G (0) (i - s) = Q(t - s) exp(-A(t - «))• ( 7 ) 
O is the step function. D^it, s) is the example Green's function given by 

D fl (t,s) = 5(t-s)+ [ dt'D^t,t')g"(t')G(t',s). (8) 



Our key to the macroscopic description of the learning dynamics is to relate the activation of the examples to their 
cavity counterparts, which is known to be Gaussian. Multiplying both sides of rta) and summing over j, we have 
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x {t)-h (t) = / dsG{t,s)g' (s) 



(9) 



In turn, the covariance of the cavity activation distribution is provided by the fluctuation-response relation explained 
in Appendix B, 



C(t,s) = a J dt'G (0 \t~t')(g'^t')x^s))+2T J dt'G (0) (t - t')G(s,t'). (10) 
Furthermore, for teacher-generated examples, its mean is related to the teacher-student correlation given by 

R(t) = a [ dt'G^(t-t')(g'(t')y,). (11) 



For a given teacher activation y of a trained example, the distribution for a set of student activation {x(t)} of the 
same example at different times is, in the limit of infinitesimal time steps At, given by 



c(t) - h(t) - AtJ2G(t, S )g'(x(s)) 



(12) 



This can be written in an integral form which is often derived from path integral approaches, 

dh(t)dh(t) 



p(Mt)}\y) = U 



2- 



exp< i / dth{t)[h{t) - R(t)y] 



dt / dsh(t)C(t,s)h(s) 



IP 



,(t) - h(t) - AtJ2G(t, S )g'(x( S )) 



(13) 



The above distributions and parameters are sufficient to describe the progress of learning. Some common performance 
measures used for such monitoring purpose include: 

(a) Training error e t , which is the probability of error for the training examples, and can be determined from the 
distribution p{x\y) that the student activation of a trained example takes the value x for a given teacher activation y 
of the same example. 

(b) Test error ttest , which is the probability of error when the inputs £jr of the training examples are corrupted by 
an additive Gaussian noise of variance A 2 . This is a relevant performance measure when the perceptron is applied 
to process data which are the corrupted versions of the training data. When A 2 = 0, the test error reduces to the 
training error. Again, it can be determined from p(x\y), since the noise merely adds a variance of A 2 C(i, t) to the 
activations. 

(c) Generalization error e g for teacher-generated examples, which is the probability of error for an arbitrary input 
£j when the teacher and student outputs are compared. It can be determined from R(t) and C(t,t) since, for an 
example with teacher activation y, the corresponding student activation is a Gaussian with mean R(t)y and variance 
C(t,t). 



IV. SIMULATION RESULTS 

The success of the cavity approach is illustrated by the many results presented previously for the Adaline rule @,|l5| . 
This is a common learning rule and bears resemblance with the more common back-propagation rule. Theoretically, 
its dynamics is particularly convenient for analysis since g"(x) = —1, rendering the weight Green's function time 
translation invariant, i.e. G(t, s) = G(t — s). In this case, the dynamics can be solved by Laplace transform. 

The closed form of the Laplace solution for Adaline learning enables us to examine a number of interesting phe- 
nomena in learning dynamics. For example, an overtraining with respect to the generalization error e g occurs when 
the weight decay is not sufficiently strong, i.e., e g attains a minimum at a finite learning time before reaching a 
higher steady-state value. Overtraining of the test error e test also sets in at a sufficiently weak weight decay, which is 
approximately proportional to the noise variance A 2 . We also observe an equivalence between average dynamics and 
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noiseless dynamics, namely that a perceptron constructed using the thermally averaged weights is equivalent to the 
perceptron obtained at a zero noise temperature. All these results are well confirmed by simulations. 

Rather than further repeating previous results, we turn to present results which provide more direct support to the 
cavity method. In the simulational experiment in Fig. we compare the evolution of two perceptrons {Jj(t)} and 
{Jj(t)} in Adaline learning. At the initial state Jj(0) — Jj(0) = l/N for all j, but otherwise their subsequent learning 
dynamics are exactly identical. Hence the total sum ^2j(Jj(t) — Jj(t)) provides an estimate for the averaged Green's 
function G{t, 0), which gives an excellent agreement with the Green's function obtained from the cavity method. 

Using the Green's function computed from Fig. we can deduce the cavity activation for each example by measuring 
their generic counterpart from the simulation and substituting back into Eq. (^). As shown in the histogram in Fig. 
0(a), the cavity activation distribution agrees well with the Gaussian distribution predicted by the cavity method, 
with the predicted mean and variance C(t,t). 

Similarly, we show in Fig. |^(b) the distribution of hsgny, i.e., the cavity activation in the direction of the correct 
teacher output, The cavity method predicts a Gaussian distribution with mean y/2/TtR(t) and variance C(t,t) — 
2R(t) 2 /it. Again, it agrees well with the histogram obtained from simulation. 



1.0 f 




time / 



FIG. 1. The Green's function G(t,0) for Adaline learning at a given training set size a = 1.2 and T = for different weight 
decay strengths A. Theory: solid line, simulation: symbols. 




FIG. 2. (a) The cavity activation distribution h for Adaline learning at a = 1.2, A = 0.1, T — and t = 2. Theory: dashed 
line, with mean and variance 0.499, simulation: histogram, with mean 0.000 and variance 0.499. (b) The distribution of 
hrmsgny. Theory: solid line, with mean 0.413 and variance 0.329, simulation: histogram, with mean 0.416 and variance 0.326. 
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V. STEADY-STATE BEHAVIOR 



When learning reaches a steady state at T — 0, the cavity and generic activations approach a constant. Hence Eq. 
(0) reduces to 

x -h = ig'{x )\ 7 = / dsG{t, s), (14) 



where 7 is called the local susceptibility in . Hence xq is a well-defined function of ho . 

Eq. ([14] ) can also be obtained by minimizing the change in the steady-state energy function when example is 
added, which is —g(xg) + (xq — ho) 2 /2-f, the second term being due to the reaction effects of the background examples. 
This was shown in for the case of a constant weight magnitude, but the same could be shown for the case of a 
constant weight decay. 

A self-consistent expression for 7 can be derived from the steady-state behavior of the Green's function. Since 
the system becomes translational invariant in time at the steady state, Eqs. (|^) and (^) can be solved by Laplace 
transform, yielding 

G(z)=& \z)+a& \z)(D^z)g^G(z), (15) 
D (l (z) = l + D ll (z)fiG(z), (16) 

with & a \z) = (z + Xy 1 . Identifying G(0) with 7, we obtain 
Making use of the functional relation between x u and h a -, we have 

7-^1 -cock *-(i-s£). (") 

where x is called the nonlocal susceptibility in pi) ]. 

At the steady state, the fluctuation response relations in Eqs. (|l(i| ) and ( |TT|) yield the self-consistent equations for 
the student-student and teacher-student correlations, C = J ■ J and R = J ■ B respectively, namely 

C=~{g'^); R=~{ 9 '^}. (19) 



Substituting Eqs. ( |14| ) and (|18[), and introducing the cavity activation distributions, we find 

C = (1 - a X y X a J Dy J DhP{h\y)(x(h) - h)x(h), (20) 
R = (1 - « X ) _1 a I Dy f DhP(h\y)(x(h) - h)y. (21) 



Since P(h\y) is a Gaussian distribution with mean Ry and variance C — R 2 , its derivatives with respect to h and i? 
are — (h — Ry)P(h\y)/(C — R 2 ) and R(h — Ry)P(h\y)/{C — R 2 ) respectively. This enables us to use integration by 
parts and Eq. (Hn) for x to obtain 



C = a j Dy J DhP{h\y){x(h) - h) 2 , (22) 
R = al J Dy J DhP(h\y)^-g xy . (23) 



Hence we have recovered the macroscopic parameters described by the static version of the cavity method in |21| by 
considering the steady-state behavior of the learning dynamics. We remark that the saddle point equations in the 
replica method also produce identical results, although the physical interpretation is less transparent p8| , p0| . 
We can further derive the microscopic equations by noting that at equilibrium for T = 0, Eq. (|l|) yields 



G 



which leads to the set of equations 



^ = \Y,9'„Q^ Q»» yXCC- (25) 
A" i 

The TAP equations are obtained by expressing these equations in terms of the cavity activations via Eq. (|l4|) . 

h v = sXxjh^) - h^Q^ + axx(h v ). (26) 

The iterative solution of the equation set was applied to the maximally stable perceptron, which yielded excellent 
agreement with the cavity method, provided that the stability condition discussed below is satisfied |21jj . However, 
the agreement is poorer when applied to the committee tree |22| and the pruned perceptron p4[ , where the stability 
condition is not satisfied. 

To study the stability condition of the cavity solution, we consider the change in the steady-state solution when 
example is added to the training set. Consider the magnitude of the displaced weight vector A = Ylji^j ~ Jj) 2 - 
Using either the static or dynamic version of the cavity method, we can show that 



1 (x Q - h ) 5 



^1 

1 — a 



(27) 




( i _ da^ > 

v V Oh* j 

In order that the change due to the added example is controllable, the stability condition is thus 

k)') < 1 

This is identical to the stability condition of the replica-symmetric ansatz in the replica method, the so-called Almeida- 
Thouless condition J3l[ |. 

As a corollary, when a band gap exists in the activation distribution, the stability condition is violated. This is 
because the function x(h) becomes discontinuous in this case, implying the presence of a delta- function component 
in dx/dh. 

Such is the case in the nonlinear perceptron trained with noisy examples using the backpropagation algorithm 
p2| . For insufficient examples and weak weight decay, the activation distribution exhibits a gap for the more difficult 
examples, i.e., when the teacher output y and the cavity activation h has a large difference. As shown in Fig. 
^(a), simulational and theoretical predictions of the activation distributions agree well in the stable regime, but the 
agreement is poor in the unstable regime shown in Fig. |3|(b). Hence the existence of band gaps necessitates the 
picture of a rough energy landscape, as described in the following section. 




FIG. 3. Typical student activation distributions at a — 3 and A = 0.002, (a) in the stable regime in which the teacher 
activations are corrupted by noises of variance 0.1, (b) in the unstable regime in which the teacher activations are corrupted 
by noises of variance 5. 
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VI. THE PICTURE OF ROUGH ENERGY LANDSCAPES 



To consider what happens beyond the stability regime, one has to take into account the rough energy landscape of 
the learning space. To keep the explanation simple, we consider the learning of examples generated randomly, the case 
of teacher-generated examples being similar though more complicated. Suppose that the original global minimum 
for a given training set is a. In the picture of a smooth energy landscape, the network state shifts perturbatively 
after adding example 0, as schematically shown in Fig. ^(a). In contrast, in the picture of a rough energy landscape, 
a nonvanishing change to the system is induced, and the global minimum shifts to the neighborhood of the local 
minimum j3, as schematically shown in Fig. |J(b). Hence the resultant activation is no longer a well-defined 
function of the cavity activation . Instead it is a well-defined function of the cavity activation h^. Nevertheless, 
one may expect that correlations exist between the states a and (3. 




FIG. 4. Schematic drawing of the change in the energy landscape in the weight space when example is added, for the 
regime of (a) smooth energy landscape, (b) rough energy landscape. 

Let qo be the correlation between two local minima labelled by f3 and 7, i.e. J 13 ■ J 7 = qo- Both of them are centred 
about the global minimum a, so that J a ■ J 13 = J a ■ J 7 = y/qoqi, where qx — J a ■ J a = J 13 ■ J 13 = ,P ■ J 7 . Since 
both states a and (3 are determined in the absence of the added example 0, the correlation {h^h^) = ^/qoqi as well. 

Knowing that both and Hq obey Gaussian distributions, the cavity activation distribution can be determined if 
we know the prior distribution of the local minima. 

At this point we introduce the central assumption in the cavity method for rough energy landscapes: we assume 
that the number of local minima at energy E obeys an exponential distribution 

<M{E) (x exp(-wE)dE. (29) 

Similar assumptions have been used in specifying the density of states in disordered systems |fil| . Thus the cavity 
activation distribution is given by 

J dh$G(h$\hg)exp[-wAE(x(h$))] 



where G(h%\k§) is a Gaussian distribution with mean y/qo/qihy and variance qi — qo- AE is the change in energy 
due to the addition of example 0, and is equal to — g(x^) + (xq — h^) 2 [ /2j. The weights jj 3 are given by 

J / = ^Es'(«- (31) 



Self-consistent equations for the macroscopic parameters are derived in Appendix C. The results are identical to the 
first step replica symmetry-breaking solution in the replica method. 

It remains to check whether the microscopic equations have been modified due to the roughening of the energy 
landscape. In terms of the generic activations, the microscopic equations are identical to Eq. (123) for each local 
minimum. In terms of the cavity activations, the TAP equations are again identical to Eq. (j26|), except that the 
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nonlocal susceptibility x is now evaluated in the corresponding local minimum. The cavity activation distribution is 
no longer a Gaussian distribution, but is modified by the density of states in Eq. ( ^p|) now. Hence the values of \ 
and 7 appearing in the TAP equations are no longer identical to the case of restricting learning to a single valley. 



VII. CONCLUSION 



In summary, we have introduced a general framework for modeling the dynamics of learning based on the cavity 
method, which is applicable to general learning cost functions, though its tractable solutions are not generally available. 

We have verified its validity by simulations of the cavity activation distributions. The steady-state behavior is seen 
to be consistent with the static version of the cavity method in the picture of smooth energy landscapes, which is 
equivalent to the replica symmetric ansatz in the replica method. This picture is based on the assumption that the 
dynamics is stable against perturbations, and is manifested in a stability condition equivalent to the Almeida-Thouless 
condition in the replica method. Beyond the stability regime, rough energy landscapes have to be introduced, but 
the microscopic TAP equations remain valid. 

There are two interesting issues concerning the extension of the present work. First, it is interesting to consider how 
the dynamics is modified in the picture of rough energy landscapes. In this case, aging effects may appear, and the 
dynamics may not be translationally invariant in time ]33[ |. Second, it is interesting to consider whether the analysis 
remains tractable for nonlinear learning rules. In general, D^(t, s) in (^) has to be expanded as a series. Nevertheless, 
we have shown that the asymptotic dynamics remains tractable for nonlinear learning rules. For transient dynamics, 
we may need to consider appropriate approximations. Another applicable area is the case of batch learning with very 
large learning steps, whose analysis remains simple due to its fast convergence ||. The method can also be applied 
to on-line learning of restricted sets of examples. 

An alternative general theory for learning dynamics is the dynamical replica theory S. It yields exact results for 
Hebbian learning, but for less trivial cases, the analysis is approximate and complicated by the need to solve replica 
saddle point equations at every learning instant. It is hoped that by adhering to an exact formalism, the cavity 
method can provide more fundamental insights when extended to multilayer networks. 

We thank A. C. C. Coolen and D. Saad for fruitful discussions. This work was supported by the Research Grant 
Council of Hong Kong (HKUST6130/97P and HKUST6157/99P). 



APPENDIX A: THE GREEN'S FUNCTION 

Substituting Eq. (|B|) into Eq. (||), we see that the Green's function satisfies 

j t + a) G jk (t, s) = 5(t - s)6 jk + 1 E WV&Gtoit, s). (Al) 

Introducing the bare Green's function G^(t — s) in Eq. (]?]), 

G jk (t, s) = G®(t - s)S jk + 1 Y.J dt ' GiQ) ^ ~ OWOtfG*^, s). (A2) 

fli 

This equation is represented diagrammatically in Fig. ||(a). We use a slanted line to represent an example bit, the top 
and bottom ends of the line corresponding to the example label and node label respectively. A filled circle represents 
Thin and thick lines represent the bare and dressed Green's functions G^\t — s) and G(t,s) respectively. 
The iterative solution to Eq. (A2) can be represented by the series of diagrams in Fig. ||(b). It is convenient to 



concurrently introduce the example Green's function D^it, s) as shown in Fig. ||(c). 

The average over the distribution of example inputs is done by pairing of example or node labels and are represented 
by dashed lines connecting the vertices above or below the solid lines. Pairing of example and node labels yield factors 
of 1 and a respectively. Noting that crossing diagrams do not contribute |5j , the two Green's functions can be expressed 
in terms of the self-energies £ and n M , via the Dyson's equations in Fig. ||(d). The self-energies are defined in Fig. 
^(e), and are characterized by having the first node or example paired with the last one only. The self-energies can 
in turn be expressed in terms of the Green's functions as in Fig. ||(f), thus allowing for self-consistent solutions. 

After eliminating the self-energies, the results of the diagrammatic analysis are given by Eqs. (||) and @. 
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FIG. 5. (a) Diagrammatic representation of Eq. (|A2|); (b) iterative solution to Eq. (A2); (c) the example Green's function 



(d) Dyson's equations; (e) the self-energies; (f) the self-energies in terms of the Green's functions. 



APPENDIX B: THE FLUCTUATION RESPONSE RELATION 

In terms of the bare Green's function, the solution to the dynamical equation Eq. (Q) is 
J i® = JfH / dt'G^(t-t')g'^t')^+ f dt'G^it-t'Mt'). 

M J 

Multiplying both sides by Jj(s) and summing over j, we have 

C(t,s) = a [ dt'G^it-t'W^x^s)) + f dt'G^(t-t')J2jj(sW)- (B2) 



(Bl) 



The correlation between Jj(s) and f]j(t') can be considered by comparing the learning process with another one which 
is noiseless between t' — e and t' + e, but is otherwise identical. Denoting the weight of this alternative process by 
j) r '^ \ we have 



J j (s) = j}^ t '\s)+ I dt"G( S ,t>,(*")- 



t'+e 



(B3) 



Noting that Jj (s) is uncorrelated with rjj(t'), and rjj(t") has a delta function correlation with ?7j(t') as in Eq. (Q), 
we arrive at Eq. (|To|). 

Similarly multiplying both sides by Bj and summing over j, we arrive at Eq. (|ll|). 
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APPENDIX C: MACROSCOPIC PARAMETERS IN ROUGH ENERGY LANDSCAPES 



From Eq. (|lg|), the nonlocal susceptibility is given by 



X 



dh ^ G{h ^lA h o G ( h o\ h o>- wAE a - d4/dh$) 



;<g(^i^) 



oAE 



where G(Hq) is a Gaussian with mean and variance q%. The local susceptibility 7 is given by 

1 



7 = 



A(l-ax) 

From the fluctuation response relation in Eq. we have 

rdh$G{hg\h%)e-'° AB g'(4)4 



9i 



dhgG(hgy 



fdh%G{j4\h§)t 



Substituting Eqs. (hj) and (18), we find 



(1 - ax)qi = a I wl**^ - 



! dh'„G(h" \h%)e- 



oAE 



(CI) 



(C2) 



(C3) 



(C4) 



The differentiations of G(/iq|/iq) with respect to h$ and fifi introduce factors of — (Hq — v^o/Si^o )/(<?i ~ 1o) an d 
\/qo/qi{hQ — y/qo/qiho) / (gi — go) respectively, and that of G(/Iq) with respect to Hq introduces —h^/qi. This allows 



us to use integration by parts and Eq. (|18[) for \ to obtain 



qi = a 



1 + — (<?i - go) 

7 



|9\2 



uAE 



+a-q I dh«G{h«). 



jdh^G{h^\K)e-^ E {x^-h t 



fdh%G(h%\h%)e 



-wAE 



fdh o G(h$\h$)e-^ E (4-h%) 



Jdh f ^G(h^)e- 



uAE 



(C5) 



Next we derive an equation for the interstate overlap go- Consider the steady-state solution of a local minimum J 



i 

given by Eq. ([L4|). Multiplying both sides by the weight vector Jj at another local minimum and summing over j, 
we have 



(C6) 



Proceeding as in the case of q\ , we get 



1 + —(91 - 90) 

7 



q a = a 



+a-q j dh%G{h%) 



dh%G{h%) 



fdh$G(h$\h%)e-™ AE (x$-h$) 



j dhlG{hl\h%)e~ 



wAE 



fdh%G(h$\h%)e-^ E (x$-hZf 



jdh^G(h r ^)e- 



uAE 



fdh$G(h$\h%)e-^ B (x$-h%) 



Solving Eqs. (|C|) and 

dh%G{h%: 



1 



fdh o G(h$\h%)e-^ E (x$-h$T 



fdh%G(h%\h%) 



9i + f(?i-<7o) 5 



uAE 



fdh%G(h p \h%)e- 



vAE 



l + f(<Zi-9o) 



dh%G(h%) 



Jdh o G(h^)e~^ E (x^-h^ o ) 



-1 2 



f dh$G(h%\h%)e- 



wAE 



(C7) 



(C8) 
(C9) 
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To determine the distribution of local minima, namely the parameter w, we introduce a "free energy" F(p, N) for p 
examples and N input nodes, given by 



<M(E) = exp[w(F(p,N) - E)]dE. 



(CIO) 



This "free energy" determines the averaged energy of the local minima and should be an extensive quantity, i.e. it 
should scale as the system size. Cavity arguments enable us to find an expression F(p + 1, N) — F(p, N). When the 
number of examples increases by 1, the density of states for a given h$ are related by 



K(E p+1 ,h%) = J dE p X(E p ,h%) J dh%G(h%\h°)S(E p+1 -E p -AE). 



Using Eq. flCKj) we obtain, on averaging over /zq , 

F(p + l,N) = F(p,N)-± J dh%G(h%)ln J dh$G(h$\h%)t 



vAE 



(Cll) 



(C12) 



Similarly, we may consider a cavity argument for the addition of one input node, expanding the network size from N 
to N + 1. Skipping the details, the final result is 



F(p,N + l)-F(p,N) = - 



2- 



l + ^(<Zi-9o) 



2w 



i + —{qi - qo) 

7 



A 

2 qi - 



(C13) 



Since F is an extensive quantity, F(p, N) should scale as N for a given ratio a = p/N. This implies 
F dF 

N = dN = {F{P} N+ V~ F ^ N ^ + ai ~ F ^ + 1 > N )~ F (P> 
When E — F, the density of states reduces to O(e ) and the global minimum is reached. Hence 



(C14) 



J dh%G(h%y 



anru a JdhZG(K\K)e-™ AE g(x$) 



qo 



fdh$G(h o \hg)e-^ ~ 2 7 [l + H ((?1 _ qo) 



-J-m 

2w 



i + —(qi - qo) 

7 



^ / dh%G(h%)ln / dh$G(h$\h§)e 



-wAE 



(C15) 



Eqs. dClj), (C2), (|Cq), (JCS|) and (C15|) form a set of five equations for x, 7, 51, q and w. 
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